Model Selection

Multimodal Model

# Multimodal Model

SpaceOm-GGUF is a multimodal model focusing on visual question answering tasks and performs excellently in spatial reasoning.

Text-to-Image English

Qwen2 VL 7B Captioner Relaxed GGUF

This model is a GGUF format conversion based on Qwen2-VL-7B-Captioner-Relaxed, optimized for image-to-text tasks and supports running via tools like llama.cpp and Koboldcpp.

Image-to-Text English

Vit GPT2 Image Captioning

An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.

Vit GPT2 Image Captioning

An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.

Florence 2 Large TableDetection

A multimodal table detection model fine-tuned based on the Florence-2 model, capable of precisely locating table areas in images.

Paligemma Vqav2

This model is a fine-tuned version of google/paligemma-3b-pt-224 on a subset of the VQAv2 dataset, specializing in visual question answering tasks.

CheXagent is a foundational model focused on chest X-ray interpretation, designed to assist in medical imaging analysis.

Transformers Other

Vit Base Patch16 224 Turkish Gpt2 Medium

This is a vision encoder-decoder model based on ViT and Turkish GPT-2 for generating Turkish image captions.

Transformers Other

Xrayclip Vit L 14 Laion2b S32b B82k

CheXagent is a foundation model specifically designed for chest X-ray interpretation, capable of automatically analyzing and interpreting chest X-ray images.

ChartLlama is a multimodal model based on the LLaVA-1.5 architecture, specializing in chart understanding and analysis tasks.

Large Language Model

Transformers English

Blip Image Captioning Base Test Sagemaker Tops 3

This model is a fine-tuned version of Salesforce's BLIP image captioning base model on the SageMaker platform, primarily used for image caption generation tasks.

Swin Aragpt2 Image Captioning V3

An image captioning model based on Swin Transformer and AraGPT2 architecture, capable of generating textual descriptions for input images.

Saved Model Git Base

A vision-language model fine-tuned on image folder datasets based on microsoft/git-base, primarily used for image caption generation tasks

Transformers Other

Blip2 Flan T5 Xl Sharded

This is a sharded version of the BLIP-2 model implemented with Flan T5-xl for image-to-text tasks such as image captioning and visual question answering. Sharding allows it to be loaded in low-memory environments.

Transformers English

An image caption generation model based on the VisionEncoderDecoder architecture, capable of converting input images into natural language descriptions.

Clip Vit Large Patch14 Ko

Korean CLIP model trained via knowledge distillation, supporting Korean and English multimodal understanding

Transformers Korean

Layoutlmv3 Finetuned Wildreceipt

A version fine-tuned on the WildReceipt dataset based on the LayoutLMv3-base model, designed for receipt key information extraction tasks

Text Recognition

Theivaprakasham

A vision-language model based on ViT-GPT2 architecture for image-to-text tasks

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase